

PARALLEL



BORAT

# Par Lab: Where we ended up

I N G

Ρυτ

Krste Asanovic, Ras Bodik, Jim Demmel, Armando Fox, Tony Keaveny, Kurt Keutzer, John Kubiatowicz, Nelson Morgan, Dave Patterson, Koushik Sen, David Wessel, and Kathy Yelick UC Berkeley

> Par Lab End-of-Project Party May 30, 2013



### Par Lab Timeline





### Electrical Engineering and Initial Par Lab "Lasagna" Stack

Computer Sciences

#### Easy to write correct programs that run efficiently on manycore

BERKELEY PAR LAB



# **Dominant Application Platforms**

Controversial in 2005,

"Obvious" in 2013

BERKELEY

;-) 4

026

Laptop/Handheld ("Mobile Client")

Par Lab focuses on mobile clients

Data Center or Cloud ("Cloud")

RAD Lab/AMPLab focuses on Cloud

Both together ("Client+Cloud")

ParLab-AMPLab collaborations





- Original predictions, 2x cores every 2 years
  - 256 cores by 2013
- Reality was <2+ cores every 2 years</p>
  - 8-16 cores in 2013
- But real growth was in SIMD performance
  - Wider, more capable SIMD units on multicore
  - GP-GPUs

### Recent GPUs have up to 2,048 vector lanes!





- Many of the parallel patterns are amenable to data-parallel execution
- Despite limited memory capacity and cumbersome programming model, GPUs were surprisingly effective on wide range of apps
  - Easier to get higher speedups than multicore
  - Apps developers voted with their feet
- Prediction: Better CPU SIMD extensions and integrated GPUs having to use same memory system will blur/narrow CPU/GPU difference



- We architected our bare minimum requirements for accurate performance/energy counters
- Bad news: In 2013, commercial processors still don't meet our bare minimum
- Good news: Energy counters have started appearing
- Prediction: Counters should be given higher priority but will continue to be "unloved" parts of future architectures

#### Electrical Engineering and **Computer Sciences**

### **RAMP Gold**



- Rapid accurate simulation of manycore architectural ideas using FPGAs
- Initial version models 64 cores of SPARC v8 with shared memory system on \$750 board
- Hardware FPU, MMU, boots our OS and Par Lab stack!



|                                                      | Cost            | Performance<br>(MIPS) | Time per 64 core simulation |  |  |  |  |  |  |
|------------------------------------------------------|-----------------|-----------------------|-----------------------------|--|--|--|--|--|--|
| Software<br>Simulator                                | \$2,000         | 0.1 - 1               | 250 hours                   |  |  |  |  |  |  |
| RAMP Gold                                            | \$2,000 + \$750 | 50 - 100              | 1 hour                      |  |  |  |  |  |  |
| Download at: https://sites.google.com/site/rampgold/ |                 |                       |                             |  |  |  |  |  |  |



RAMP Gold design forms core of DIABLO "Datacenter-In-A-Box at LOw cost"

Computer Sciences

- Execution-driven model of entire 2,000-node datacenter including network switches
- Now, generate FPGA emulations (FAME-0) of own RISC-V processors from Chisel code
- Future, developing techniques for automatic generation of efficient FPGA models from RTL
  - Chisel automatically generating higher FAME
  - New DREAMER emulation architecture



# **RISC-V** ISA



- A new clean-slate open-source ISA to support research and education
- Ports of Tessellation, Akaros, Linux OS, gcc, LLVM,...
- Multiple implementations including "Rocket" inorder research core, plus "Sodor" family of educational processors
- New vector-thread architecture designs
- FPGA emulations + tapeouts of real chips
- To be released soon at: http://www.riscv.org

# Chisel: Constructing Hardware In a Scala Embedded Language



- Embed a hardware-description language in Scala, using Scala's extension facilities
- A hardware module is just a data structure in Scala
- Different output routines can generate different types of output (C, FPGA-Verilog, ASIC-Verilog) from same hardware representation
- Full power of Scala for writing hardware generators
  - Object-Oriented: Factory objects, traits, overloading etc
  - Functional: Higher-order funcs, anonymous funcs, currying
  - Compiles to JVM: Good performance, Java interoperability

Download from http://chisel.eecs.berkeley.edu

#### EECS Electrical Engineering and Computer Sciences "Agile Hardware" Development





Multiple 28nm and 45nm GHz-class processor tapeouts





- Pursued our original approach of very thin hypervisor layer managing partitions
- Many ideas swirling in early days of project, concrete implementations on real x86 hardware and RAMP Gold helped provide focus
- OS group split into cloud team that moved to AMPLab (Akaros) and client team in Par Lab (Tessellation)

### Electrical Engineering and Computer Sciences + 2-Level Scheduling





- 1<sup>st</sup> level: OS determines coarse-grain allocation of resources to jobs over space and time
- 2<sup>nd</sup> level: Application schedules component tasks onto available "harts" (hardware thread contexts) using Lithe

### **PACORA: Resource Management using Convex Optimization**



- Each process receives a vector of basic resources dedicated to it
  - e.g., fractions of cores, cache slices, memory pages, bandwidth
- Allocate minimum for QoS requirements
- Allocate remaining to meet some system-level objective
  - e.g., best performance, lowest energy, best user experience







- *Merged* resource and computation abstraction.
- More accurate resource abstraction.
- Let apps provide own computation abstractions



- Lithe is an ABI to allow application components to co-operatively share hardware threads.
- Each component is free to map computational to hardware threads in any way they see fit
  - No mandatory thread or task abstractions
- Components request but cannot demand harts, and must yield harts when blocked or finished with task



### Types of Programming (or "types of programmer")



18

#### Example Languages

### **Example Activities**

Max/MSP, SQL, **Domain-Level** CSS/Flash/Silverlight, (No formal CS) Matlab, Excel Builds app with DSL and/or by customizing app framework

| Productivity<br>(So Where & ho | Python/Ruby/Lua<br>w to make para<br>Scala | Uses<br>Ilelism<br>Application<br>frameworks (or apps)                        |
|--------------------------------|--------------------------------------------|-------------------------------------------------------------------------------|
| Efficiency-Level<br>(MS in CS) | Java/C#<br>C/C++/FORTRAN<br>assembler      | Uses hardware/OS<br>primitives, builds<br>programming<br>frameworks (or apps) |
| Hardware/OS                    |                                            | ovides hardware<br>mitives and OS services                                    |





In a new general-purpose parallel language?

- An oxymoron?
- Won't get adopted
- Most big applications written in >1 language

### Par Lab bet on Computational and Structural Patterns at all levels of programming (Domain thru Efficiency)

- Patterns provide a good vocabulary for domain experts
- Also comprehensible to efficiency-level experts or hardware architects
- Lingua franca between the different levels in Par Lab





### How do compelling apps relate to 13 dwarfs?

| Apps               | ed    | U<br>U |   | es    |    |     |     |               | MA    |        |       |         |
|--------------------|-------|--------|---|-------|----|-----|-----|---------------|-------|--------|-------|---------|
| Computation        | Embed | SPEC   | B | Games | ML | НРС | CAD | ぞうグ<br>Health | Image | Speech | Music | Browser |
| Graph Algorithms   |       |        |   |       |    |     |     |               |       |        |       |         |
| Graphical Models   |       |        |   |       |    |     |     |               |       |        |       |         |
| Backtrack / B&B    |       |        |   |       |    |     |     |               |       |        |       |         |
| Finite State Mach. |       |        |   |       |    |     |     |               |       |        |       |         |
| Circuits           |       |        |   |       |    |     |     |               |       |        |       |         |
| Dynamic Prog.      |       |        |   |       |    |     |     |               |       |        |       |         |
| Unstructured Grid  |       |        |   |       |    |     |     |               |       |        |       |         |
| Structured Grid    |       |        |   |       |    |     |     |               |       |        |       |         |
| Dense Matrix       |       |        |   |       |    |     |     |               |       |        |       |         |
| Sparse Matrix      |       |        |   |       |    |     |     |               |       |        |       |         |
| Spectral (FFT)     |       |        |   |       |    |     |     |               |       |        |       |         |
| Monte Carlo        |       |        |   |       |    |     |     |               |       |        |       |         |
| N-Body             |       |        |   |       |    |     |     |               |       |        |       |         |

# Computer Sciences Pattern Language (OPL-2010)



#### **EECS** Electrical Engineering and Computer Sciences Mapping Patterns to Hardware





### Only a few types of hardware platform







### Electrical Engine and Computed fine reasonable low-level mappings







### aka. "Stovepipes"



Allow maximum efficiency and expressibility in specializers by avoiding mandatory intermediary layers



### Electrical Engineering and Computer Sciences Just-In Time Specialization"



- SEJITS bridges productivity and efficiency layers through specializers embedded in modern high-level productivity language (Python, Ruby)
  - Embedded "specializers" use language facilities to map high-level pattern to efficient low-level code (at run time, install time, or development time)
  - Specializers can incorporate or package autotuners

### **Two ParLab SEJITS projects:**

- Copperhead: Data-parallel subset of Python, development continuing at NVIDA
- Asp: "<u>A</u>sp is <u>S</u>EJITS in <u>Python</u>" general specializer framework
  - Provide functionality common across different specializers



### **SEJITS In A Nutshell**







### **Current Par Lab Stack**





### Supporting QoS inside Apps

Electrical Engineering and Computer Sciences







### Communication-Avoiding Algorithms



- Past algorithms: FLOPs expense, Moves cheap
- New theory: proves lower bounds on data movement; both serial (memory hierarchy) and parallel data movement
- New practice: codes achieve lower bound and speedups
- Widely applicable: all linear algebra, Health app...



Idea #1: read a piece of a sparse matrix (= graph) into fast memory and take multiple steps of higher-level algorithm



Idea #2: replicate data (including lefthand side arrays, as in C in C=A\*B) and compute partial results, reduce later

### A few examples of speedups



### Matrix multiplication

Computer Sciences

- Up to 12x on IBM 64K-core BG/P for n=8K; 95% less communication
- QR decomposition (used in least squares, data mining, ...)
  - Up to 8x on 8-core dual-socket Intel Clovertown, for 10M x 10
  - Up to 6.7x on 16-proc. Pentium III cluster, for 100K x 200
  - Up to 13x on Tesla C2050 / Fermi, for 110k x 100
  - "infinite speedup" for out-of-core on PowerPC laptop
    - LAPACK thrashed virtual memory, didn't finish
- Eigenvalues of band symmetric matrices
  - Up to **17x** on Intel Gainestown, 8 core, vs MKL 10.0
- Iterative sparse linear equations solvers (GMRES)
  - Up to 4.3x on Intel Clovertown, 8 core
- N-body (direct particle interactions with cutoff distance)
  - Up to 10x on Cray XT-4 (Hopper), 24K particles on 6K procs.

Next: automatically xform code; new "HBL" theory just out!

### Autotuning: Computers, Rather than People Tune Code



- Autotuners are code generators plus search
- Avoids two unsolved compiler problems: dependence analysis and accurate performance models
- New: particles, stencils, graphs,... and manylane/core optimizations
- New roofline model to aid in performance understanding



Work by Williams, Oliker, Shalf, Madduri, Kamil, Im, Ethier,...

### Parallel Correctness: Testing and Debugging



### Some Results

**Computer Sciences** 

- Active Testing: for finding non-deterministic bugs such as data races, deadlocks, atomicity violations
  - Open-source [BSD-licensed] CalFuzzer for Java
  - Thrille for C/Pthreads and UPC
- Specification and assertion framework for parallelism correctness
  - Introduced Bridge Predicates
  - Nondeterministic Sequential Specification to separate parallel correctness
    from functional correctness
- Concurrit: A testing framework for concurrent programs
  - JUnit or xUnit for concurrent programs
  - Applied to Chrome and Firefox browsers
- Concolic testing: for automated test input generation
  - Java
  - Javascript (ongoing)

### **Parallel Browser**

**Computer Sciences** 





- 2007 Vision: desktop-quality browsing on mobiles.
- **Now**: yes, but only 200 web pages / battery charge.
- \* 2007 Vision: browser as an app platform.
- **Now**: Google Glass is a browser app.





Parallelism improves responsiveness, energy use.

- 2007: parallel browser controversial
- 2013: Mozilla Servo & Google Blink browsers

Some of our results:

- First scalable parser (in Qualcomm browser)
- Synthesizer of parallel layout engines
- Parallel layout retrofitted to Safari via WebCL
- Collaboration with Mozilla on parallel Servo





<u>Synthesis</u>: search a huge space for a program that is semantically correct and high-performance

- alternative to classical AST-rewrite compilers
- search implemented as constraint solving

Some of our synthesizers:

- FTL: parallel layout engines
- Programming for ULP spatial manycore
- SQL query programming by demonstration

# **Music Application**

p buffers

receive- sndout1 receive- sndout2 receive- sndout3 receive- sndout5 receive- sndout5 receive- sndout6 receive- sndout7 receive- sndout8

lookup-table-maker



New user interfaces with pressure-sensitive multi-touch gestural interfaces

1.00

Gestures

Electrical Engineering and Computer Sciences

120-channel speaker array

Programmable virtual instrument and audio processing

Audio

# More Applications: "Beating down our doors!"



- 5 Original Apps: Parallel Browser (Ras Bodik), Music (David Wessel), Speech (Nelson Morgan), Health (Tony Keavney), Image Retrieval (Kurt Keutzer)
- New external application collaborators:
- Pediatric MRI (Michael Lustig, Shreyas Vassanwala @Stanford)
- Multimedia and Speech (Dorothea Kolossa @TU Berlin)
- Computer Vision (Jitendra Malik, Thomas Brox)
- Computational Finance (Matthew Dixon @UCD)
- Natural Language Translation (Dan Klein)
- Programming multitouch interfaces (Maneesh Agrawala)
- Protein Docking (Henry Gabb, Intel)

Computer Sciences

### **Pediatric MRI**



Typical exam ~ 1 hour Motion blurs the images Scanner is a small loud tunnel

Electrical Engineering and Computer Sciences

Difficult for children to stay still!

Traditional Solution: Anesthesia

Compressed Sensing reduces each scan to 15 seconds



But takes too long (hours) to reconstruct image



#### Compressed Sensing for Pediatric MRI



- Image reconstruction from 1-2 hours down to < 1 min
- In use in clinical trials





**Today's Demos** 





# Demo: The Parallel Meeting Diarist

#### Gerald Friedland International Computer Science Institute (ICSI)



Work together with primarily: Katya Gonina **David Sheffield Adam Janin Brian Hill Jike Chong Nelson Morgan Kurt Keutzer** Ganapati S. Mishali N.

BERKELEY PAR LAB



#### **Components of the Meeting Diarist**



Created a fully integrated, very fast meeting diarist SEJITized that is currently tech-transfered to Intel.

BERKELEY PAR LAB





|            | Before Par Lab | After Par Lab |
|------------|----------------|---------------|
| "who spoke | ~10k LOC       | ~100 LOC      |
| when"      | 0.3 x RT       | 250 x RT      |
| "what was  | ~100k LOC      | ~ 5k LOC      |
| said"      | 0.1 x RT       | 1 x RT        |

Created a fully integrated, very fast meeting diarist using SEJITs:

Online=Offline processing for "who spoke when"

Diarization used for BIG DATA video processing

Tech-transfer to Intel





- 193 Conference Papers
- 28 Journal Papers
- 94 Technical Reports
- 12+ Best Paper Awards
- Berkeley View TR, >1,300 citations
  Par Lab papers, >8,000 citations





- Par Lab summer bootcamps
  - 2009-2012: >1300 attendees including >300 from industry
- CS61C Reworked intro architecture class with parallelism
  - 480 students in Fall 2012 semester
  - Revised 5<sup>th</sup> edition of undergrad text (used at 400 universities)
- CS152 Undergrad Architecture using Chisel processors
- CS164 Students design and own DSLs, implement a browser
- CS194 Undergrad parallel patterns class
  - 3<sup>rd</sup> offering, co-taught with Tim Mattson, Intel
  - 3 posters here from successful undergrad projects
- CS250 Graduate VLSI Design using Chisel/Agile Hardware
- CS267 Graduate parallel computing class
  - Added material on dwarfs, patterns, autotuning, apps
  - Homeworks ported to .net with Microsoft
  - NSF-funded MOOC XSEDE launched spring 2013





#### PhD students

- ✤ 91 total PhD participants,
- 23 of which graduated by 2013
- **MS Students**
- ✤ 35 total MS participants,
- 13 of which graduated by 2013
   Post-Docs
- 5 current

# Par Lab Book



#### 18 chapters

Electrical Engineering and <u>Computer Sc</u>iences

- Overview + 1-2 research papers
- **☆** ≈ 600 pages
- Expected by June 30
  - Amazon Ebook \$0.99

Print book signup page

# The Berkeley Par Lab

Progress in the Parallel Computing Landscape



Edited by David Patterson, Dennis Gannon, Michael Wrinn



### Multiple Follow-On Projects Underway



- OS and Music work continuing in new SWARM Lab, programming the "swarm" of environmental devices
  - http://swarmlab.eecs.berkeley.edu
- Software synthesis, Correctness -> Chaperone, ExCape NSF Expedition
- ASPIRE: patterns, communication-avoiding algorithms, SEJITS, RISC-V, specialized architectures, Chisel, Agile Hardware http://aspire.eecs.berkeley.edu



**Computer Sciences** 

- BERKELEY PAR
- Research supported by Microsoft (Award #024263) and Intel (Award #024894) funding and by matching funding by U.C. Discovery (Award #DIG07-10227).
- Additional support comes from Par Lab affiliates National Instruments, NEC, Nokia, NVIDIA, Samsung, and Oracle/Sun.